Search CORE

15 research outputs found

Efficient instruction and data caching for high-performance low-power embedded systems

Author: Ferrerón Labari Alexandra
Suárez Gracia Darío
Publication venue: 'Universidad de Zaragoza'
Publication date: 01/01/2012
Field of study

Although multi-threading processors can increase the performance of embedded systems with a minimum overhead, fetching instructions from multiple threads each cycle also increases the pressure on the instruction cache, potentially harming the performance/consumption ratio. Instruction caches are responsible of a high percentage of the total energy consumption of the chip, which for battery-powered embedded devices becomes a critical issue. A direct way to reduce the energy consumption of the first level instruction cache is to decrease its size and associativity. However, demanding applications, and specially applications with several threads running together, might suffer a dramatic performance slow down, or even increase the total energy consumption of the cache hierarchy, due to the extra misses incurred. In this work we introduce iLP-NUCA (Instruction Light Power NUCA), a new instruction cache that replaces the conventional second level cache (L2) and improves the Energy–Delay of the system. We provided iLP-NUCA with a new tree-based transport network-in-cache that reduces both the cache line service latency and the energy consumption, regarding the former LP-NUCA implementation. We modeled in our cycle-accurate simulation environment both conventional instruction hierarchies and iLP-NUCAs. Our experiments show that, running SPEC CPU2006, iLP-NUCA, in comparison with a state–of–the–art high performance conventional cache hierarchy (three cache levels, dedicated L1 and L2, shared L3), performs better and consumes less energy. Furthermore, iLP-NUCA reaches the performance, on average, of a conventional instruction cache hierarchy implementing a double sized L1, independently of the number of threads. This translates into a reduction of the Energy–Delay product of 21%, 18%, and 11%, reaching 90%, 95%, and 99% of the ideal performance for 1, 2, and 4 threads, respectively. These results are consistent for the considered applications distribution, and bigger gains are in the most demanding applications (applications with high instruction cache requirements). Besides, we increase the performance of applications with several threads without being detrimental for any of them. The new transport topology reduces the average service latency of cache lines by 8%, and the energy consumption of its components by 20%

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

Coherent vs. non-coherent last level on-chip caches: an evaluation of the latency and capacity trade-offs

Author: Falsafi Babak
Ferrerón Labari Alexandra
Publication venue: 'Universidad de Zaragoza'
Publication date: 01/01/2010
Field of study

El desorbitado consumo energético de los centros de datos actuales y la creciente preocupación por el medio ambiente han llevado a que las tecnologías de la información deban plantearse cómo reducir costes, a la vez que preservar el medio ambiente, en futuros centros de datos. ARM, en un consorcio con Nokia, IMEC, EPFL (Escuela Politécnica Federal de Lausanne) y UCY (Universidad de Chipre), lidera el proyecto EuroCloud, en donde se pretende desarrollar una nueva generación de servidores-on-chip con tecnología 3D y de bajo consumo para servicios de computación en nube (cloud computing). EuroCloud propone un nuevo servidor-on-chip de muy bajo consumo, utilizando procesadores ARM, aceleradores de hardware y memoria DRAM en chip integrada en 3D. En este proyecto hemos estudiado uno de los componentes principales del chip del proyecto EuroCloud, la jerarquía de memoria cache en chip, haciendo una comparación entre diferentes opciones para su organización. La configuración de la jerarquía de memoria cache en chip afectará al tiempo medio de acceso a memoria y, en consecuencia, influenciará el rendimiento global. El chip que hemos estudiado está compuesto por dos clusters. Cada cluster contiene dos procesadores con sus respectivas caches de nivel uno privadas y una porción del segundo nivel de memoria cache (en este caso el segundo nivel de cache es el último nivel de la jerarquía). Este último nivel de cache se encuentra, por tanto, físicamente distribuido entre los clusters y puede ser configurado de forma distinta. En concreto, admite dos organizaciones: caches compartidas o caches privadas. En este proyecto hemos analizado dos organizaciones: una organización compartida, en la que los dos clusters comparten el último nivel de la memoria cache, y que pretende conseguir aprovechar al máximo la capacidad efectiva de la cache, y una organización en Cluster, en donde el último nivel de cache es privado para cada cluster. En este último caso, damos prioridad a un acceso más rápido (menor latencia) a este nivel de la jerarquía. Dentro de una organización en Cluster, hemos estudiado la posibilidad de introducir un mecanismo de coherencia para este nivel. Tras una extensa labor de investigación sobre el estado del arte del tema y sobre la organización del chip y su arquitectura, hemos modelado los dos diseños antes mencionados en nuestra plataforma de simulación y simulado cargas de trabajo representativas. Hemos analizado en detalle los resultados obtenidos para distintos tamaños de memoria cache y concluido que una organización en Cluster, en general, funciona mejor. Un diseño Cluster se beneficia de una latencia de acceso más baja a la vez que proporciona en la mayoría de los casos la capacidad de cache necesaria para obtener un buen rendimiento. En los casos en que capacidad es más crítica que acceso o en cargas de trabajo con poca localidad, el diseño Compartido aventaja al diseño Cluster. En cuanto a los mecanismos de coherencia para este nivel de la jerarquía, creemos que, para el tipo de servidor estudiado y el tipo de aplicaciones consideradas, son innecesarios. Adicionalmente, hemos extendido el entorno de simulación utilizado, así como profundizado en la metodología de simulación para conseguir unos resultados más ajustados

Repositorio Universidad de Zaragoza

Exploiting Natural On-chip Redundancy for Energy Efficient Memory and Computing

Author: Alastruey Benedé Jesús
Ferrerón Labari Alexandra
Suárez Gracia Darío
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2016
Field of study

Power density is currently the primary design constraint across most computing segments and the main performance limiting factor. For years, industry has kept power density constant, while increasing frequency, lowering transistors supply (Vdd) and threshold (Vth) voltages. However, Vth scaling has stopped because leakage current is exponentially related to it. Transistor count and integration density keep doubling every process generation (Moore’s Law), but the power budget caps the amount of hardware that can be active at the same time, leading to dark silicon. With each new generation, there are more resources available, but we cannot fully exploit their performance potential. In the last years, different research trends have explored how to cope with dark silicon and unlock the energy efficiency of the chips, including Near-Threshold voltage Computing (NTC) and approximate computing. NTC aggressively lowers Vdd to values near Vth. This allows a substantial reduction in power, as dynamic power scales quadratically with supply voltage. The resultant power reduction could be used to activate more chip resources and potentially achieve performance improvements. Unfortunately, Vdd scaling is limited by the tight functionality margins of on-chip SRAM transistors. When scaling Vdd down to values near-threshold, manufacture-induced parameter variations affect the functionality of SRAM cells, which eventually become not reliable. A large amount of emerging applications, on the other hand, features an intrinsic error-resilience property, tolerating a certain amount of noise. In this context, approximate computing takes advantage of this observation and exploits the gap between the level of accuracy required by the application and the level of accuracy given by the computation, providing that reducing the accuracy translates into an energy gain. However, deciding which instructions and data and which techniques are best suited for approximation still poses a major challenge. This dissertation contributes in these two directions. First, it proposes a new approach to mitigate the impact of SRAM failures due to parameter variation for effective operation at ultra-low voltages. We identify two levels of natural on-chip redundancy: cache level and content level. The first arises because of the replication of blocks in multi-level cache hierarchies. We exploit this redundancy with a cache management policy that allocates blocks to entries taking into account the nature of the cache entry and the use pattern of the block. This policy obtains performance improvements between 2% and 34%, with respect to block disabling, a technique with similar complexity, incurring no additional storage overhead. The latter (content level redundancy) arises because of the redundancy of data in real world applications. We exploit this redundancy compressing cache blocks to fit them in partially functional cache entries. At the cost of a slight overhead increase, we can obtain performance within 2% of that obtained when the cache is built with fault-free cells, even if more than 90% of the cache entries have at least a faulty cell. Then, we analyze how the intrinsic noise tolerance of emerging applications can be exploited to design an approximate Instruction Set Architecture (ISA). Exploiting the ISA redundancy, we explore a set of techniques to approximate the execution of instructions across a set of emerging applications, pointing out the potential of reducing the complexity of the ISA, and the trade-offs of the approach. In a proof-of-concept implementation, the ISA is shrunk in two dimensions: Breadth (i.e., simplifying instructions) and Depth (i.e., dropping instructions). This proof-of-concept shows that energy can be reduced on average 20.6% at around 14.9% accuracy loss

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

Concertina: Squeezing in cache content to operate at near-threshold voltage

Author: Alastruey Jesús
Ferrerón Alexandra
Ibáñez Pablo
Monreal Arnal Teresa
Suárez Gracia Darío
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

Author: Ferrerón Alexandra
Monreal Arnal Teresa
Montesano del Campo Luis
Suárez Gracia Darío
Viñals Yúfera Víctor
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low. This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%

Repositorio Universidad de Zaragoza

A fault-tolerant last level cache for CMPs operating at ultra-low voltage

Author: Alastruey Jesús
Ferrerón Alexandra
Ibáñez Marín Pablo Enrique
Monreal Arnal Teresa
Suárez Gracía Dario
Viñals Yúfera Víctor
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

Voltage scaling to values near the threshold voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and show a behavior consistent with a hard fault. Block disabling is a micro-architectural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. In the case of the last-level cache, this capacity reduction leads to an increase in off-chip memory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundancy of multi-level inclusive hierarchies to enhance the performance of block disabling with negligible cost. The proposed fault-aware last-level cache management policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our evaluation shows that this fault-aware management results in up to 37.3% and 54.2% fewer misses per kilo instruction (MPKI) than block disabling for multiprogrammed and parallel workloads, respectively. This translates to performance enhancements of up to 13% and 34.6% for multiprogrammed and parallel workloads, respectively.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Repositorio Universidad de Zaragoza

Gestión de contenidos en caches operando a bajo voltaje

Author: Alastruey Jesús
Ferrerón Alexandra
Ibáñez Marín Pablo Enrique
Monreal Arnal Teresa
Suárez Gracía Dario
Viñals Yúfera Víctor
Publication venue
Publication date: 01/01/2016
Field of study

La eficiencia energética de las caches en chip puede mejorarse reduciendo su voltaje de alimentación (Vdd ). Sin embargo, este escalado de Vdd está limitado a una tensión Vddmin por debajo de la cual algunas celdas SRAM (Static Random Access Memory) puede que no operen de forma fiable. Block disabling (BD) es una técnica microarquitectónica que permite operar a tensiones muy bajas desactivando aquellas entradas que contienen alguna celda que no opera de forma fiable, aunque a cambio de reducir la capacidad efectiva de la cache. Se utiliza en caches de último nivel (LLC), donde el ahorro potencial es mayor. Sin embargo, para algunas aplicaciones, el incremento de consumo debido a los accesos a memoria fuera del chip no compensa el ahorro energético conseguido en la LLC. Este trabajo aprovecha recursos existentes en los multiprocesadores, como son la jerarqui´a de memoria en chip y el mecanismo de coherencia, para mejorar las prestaciones de BD. En concreto, proponemos explotar la redundancia natural existente en una jerarqui´a de cache inclusiva para mitigar la pérdida de rendimiento debida a la reducción en la capacidad de la LLC. También proponemos una nueva poli´tica de gestión de contenidos consciente de la existencia de entradas de cache defectuosas. Utilizando la información de reúso, el algoritmo de reemplazo asigna entradas de cache operativas a aquellos bloques con más probabilidad de ser referenciados. Las técnicas propuestas llegan a reducir el MPKI hasta en un 36.4 % respecto a block disabling, mejorando su rendimiento entre un 2 y un 13%.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Concertina: Squeezing in cache content to operate at near-threshold voltage

Author: Alastruey Jesús
Ferrerón Alexandra
Ibáñez Pablo
Monreal Arnal Teresa
Suárez Gracia Darío
Publication venue
Publication date
Field of study

RECERCAT

Evaluation of long time performance The behaviour of buried pipes

Author: Alastruey Jesús
Ferrerón Alexandra
Monreal Arnal Teresa
Suárez Gracia Darío
Viñals Yufera Víctor
Publication venue
Publication date: 01/01/1999
Field of study

SIGLEAvailable from British Library Document Supply Centre-DSC:7620.3469(99/WM/02/12) / BLDSC - British Library Document Supply CentreGBUnited Kingdo

Crossref

UPCommons. Portal del coneixement obert de la UPC

OpenGrey Repository

Block disabling characterization and improvements in CMPs operating at ultra-low voltages

Author: Alastruey Jesús
Ferrerón Alexandra
Monreal Arnal Teresa
Suárez Gracia Darío
Viñals Yufera Víctor
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Power density has become the limiting factor in technology scaling as power budget restricts the amount of hardware that can be active at the same time. Reducing supply voltage to ultra-low voltage ranges close to the threshold region has the promise of great energy savings. However, the potential savings of voltage scaling are limited by the correct operation of SRAM cells, which is not guaranteed below Vddmin, the minimum voltage in which cache structures operate reliably. Understanding the effects of operating below Vddmin requires complex modelling, so we introduce an updated probability failure model of SRAM cells at 22nm and explore the reliability impact of lowering the chip voltage supply below Vddmin in shared memory coherent chip-multiprocessors (CMP) running a variety of parallel workloads. A micro architectural technique to cope with cache reliability at ultra-low voltages is block disabling, however, in many cases, the savings in on-chip caches do not compensate for the consumption in the rest of the system, as the consumption increase of the off-chip memory may offset the on-chip gain. We make the case that existing coherence mechanisms can provide the substrate to improve energy savings with block disabling and propose two low-complexity techniques. Taking the best of both techniques we can scale voltage below Vddmin and reduce system energy up to 39%, and system energy-delay up to 10%. Besides, by lowering the CMP consumption in a power constrained scenario, we could activate offline cores, reaching a potential speedup between 3.7 and 4.4

RECERCAT